Fault-Management in P2P-MPI
Identifieur interne : 003957 ( Main/Exploration ); précédent : 003956; suivant : 003958Fault-Management in P2P-MPI
Auteurs : Stéphane Genaud [France] ; Emmanuel Jeannot [France] ; Choopan Rattanapoka [Thaïlande]Source :
- International Journal of Parallel Programming [ 0885-7458 ] ; 2009-10-01.
English descriptors
- KwdEn :
- mix :
Abstract
Abstract: We present in this paper a study on fault management in a grid middleware. The middleware is our home-grown software called P2P-MPI. This framework is MPJ compliant, allows users to execute message passing parallel programs, and its objective is to support environments using commodity hardware. Hence, running programs is failure prone and a particular attention must be paid to fault management. The fault management covers two issues: fault-tolerance and fault detection. Fault-tolerance deals with the program execution: P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Fault detection concerns the monitoring of the program execution by the system. The monitoring is done through a distributed set of modules called failure detectors. The contribution of this paper is twofold. The first contribution is the evaluation of the failure probability of an application depending on the replication degree. The failure probability depends on the execution length, and we propose a model to evaluate the duration of a replicated parallel program. Then, we give an expression of the replication degree required to keep the failure probability of an execution under a given threshold. The second contribution is a study of the advantages and drawbacks of several fault detection systems found in the literature. The criteria of our evaluation are the reliability of the failure detection service and the failure detection speed. We retain the binary round-robin protocol for its failure detection speed, and we propose a variant of this protocol which is more reliable than the application execution in any case. Experiments involving of up to 256 processes, carried out on Grid’5000, show that the real detection times closely match the predictions.
Url:
DOI: 10.1007/s10766-009-0115-8
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream Istex, to step Corpus: 001594
- to stream Istex, to step Curation: 001575
- to stream Istex, to step Checkpoint: 000A64
- to stream Hal, to step Corpus: 006C19
- to stream Hal, to step Curation: 006C19
- to stream Hal, to step Checkpoint: 002A06
- to stream Main, to step Merge: 003A35
- to stream Main, to step Curation: 003957
Le document en format XML
<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title xml:lang="en">Fault-Management in P2P-MPI</title>
<author><name sortKey="Genaud, Stephane" sort="Genaud, Stephane" uniqKey="Genaud S" first="Stéphane" last="Genaud">Stéphane Genaud</name>
</author>
<author><name sortKey="Jeannot, Emmanuel" sort="Jeannot, Emmanuel" uniqKey="Jeannot E" first="Emmanuel" last="Jeannot">Emmanuel Jeannot</name>
</author>
<author><name sortKey="Rattanapoka, Choopan" sort="Rattanapoka, Choopan" uniqKey="Rattanapoka C" first="Choopan" last="Rattanapoka">Choopan Rattanapoka</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:5E7C8EC4D7C270F8D66020C33884FC34178138D6</idno>
<date when="2009" year="2009">2009</date>
<idno type="doi">10.1007/s10766-009-0115-8</idno>
<idno type="url">https://api.istex.fr/ark:/67375/VQC-T0H758JH-P/fulltext.pdf</idno>
<idno type="wicri:Area/Istex/Corpus">001594</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Corpus" wicri:corpus="ISTEX">001594</idno>
<idno type="wicri:Area/Istex/Curation">001575</idno>
<idno type="wicri:Area/Istex/Checkpoint">000A64</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Checkpoint">000A64</idno>
<idno type="wicri:doubleKey">0885-7458:2009:Genaud S:fault:management:in</idno>
<idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:inria-00425516</idno>
<idno type="url">https://hal.inria.fr/inria-00425516</idno>
<idno type="wicri:Area/Hal/Corpus">006C19</idno>
<idno type="wicri:Area/Hal/Curation">006C19</idno>
<idno type="wicri:Area/Hal/Checkpoint">002A06</idno>
<idno type="wicri:explorRef" wicri:stream="Hal" wicri:step="Checkpoint">002A06</idno>
<idno type="wicri:doubleKey">0885-7458:2009:Genaud S:fault:management:in</idno>
<idno type="wicri:Area/Main/Merge">003A35</idno>
<idno type="wicri:Area/Main/Curation">003957</idno>
<idno type="wicri:Area/Main/Exploration">003957</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">Fault-Management in P2P-MPI</title>
<author><name sortKey="Genaud, Stephane" sort="Genaud, Stephane" uniqKey="Genaud S" first="Stéphane" last="Genaud">Stéphane Genaud</name>
<affiliation wicri:level="3"><country xml:lang="fr">France</country>
<wicri:regionArea>AlGorille Team, LORIA, Campus Scientifique, BP 239, 54506, Vandoeuvre-lès-Nancy</wicri:regionArea>
<placeName><region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Lorraine (région)</region>
<settlement type="city">Vandœuvre-lès-Nancy</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">France</country>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">France</country>
</affiliation>
</author>
<author><name sortKey="Jeannot, Emmanuel" sort="Jeannot, Emmanuel" uniqKey="Jeannot E" first="Emmanuel" last="Jeannot">Emmanuel Jeannot</name>
<affiliation wicri:level="3"><country xml:lang="fr">France</country>
<wicri:regionArea>AlGorille Team, LORIA, Campus Scientifique, BP 239, 54506, Vandoeuvre-lès-Nancy</wicri:regionArea>
<placeName><region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Lorraine (région)</region>
<settlement type="city">Vandœuvre-lès-Nancy</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">France</country>
</affiliation>
</author>
<author><name sortKey="Rattanapoka, Choopan" sort="Rattanapoka, Choopan" uniqKey="Rattanapoka C" first="Choopan" last="Rattanapoka">Choopan Rattanapoka</name>
<affiliation wicri:level="1"><country xml:lang="fr">Thaïlande</country>
<wicri:regionArea>Department of Electronics Engineering Technology, College of Industrial Technology, King Mongkut’s University of Technology North Bangkok, Bangkok</wicri:regionArea>
<wicri:noRegion>Bangkok</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Thaïlande</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="j">International Journal of Parallel Programming</title>
<title level="j" type="abbrev">Int J Parallel Prog</title>
<idno type="ISSN">0885-7458</idno>
<idno type="eISSN">1573-7640</idno>
<imprint><publisher>Springer US; http://www.springer-ny.com</publisher>
<pubPlace>Boston</pubPlace>
<date type="published" when="2009-10-01">2009-10-01</date>
<biblScope unit="volume">37</biblScope>
<biblScope unit="issue">5</biblScope>
<biblScope unit="page" from="433">433</biblScope>
<biblScope unit="page" to="461">461</biblScope>
</imprint>
<idno type="ISSN">0885-7458</idno>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0885-7458</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Fault-tolerance</term>
<term>Grid computing</term>
<term>Middleware</term>
<term>Parallelism</term>
</keywords>
<keywords scheme="mix" xml:lang="en"><term>Fault-tolerance</term>
<term>Grid computing</term>
<term>Middleware</term>
<term>Parallelism</term>
</keywords>
</textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Abstract: We present in this paper a study on fault management in a grid middleware. The middleware is our home-grown software called P2P-MPI. This framework is MPJ compliant, allows users to execute message passing parallel programs, and its objective is to support environments using commodity hardware. Hence, running programs is failure prone and a particular attention must be paid to fault management. The fault management covers two issues: fault-tolerance and fault detection. Fault-tolerance deals with the program execution: P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Fault detection concerns the monitoring of the program execution by the system. The monitoring is done through a distributed set of modules called failure detectors. The contribution of this paper is twofold. The first contribution is the evaluation of the failure probability of an application depending on the replication degree. The failure probability depends on the execution length, and we propose a model to evaluate the duration of a replicated parallel program. Then, we give an expression of the replication degree required to keep the failure probability of an execution under a given threshold. The second contribution is a study of the advantages and drawbacks of several fault detection systems found in the literature. The criteria of our evaluation are the reliability of the failure detection service and the failure detection speed. We retain the binary round-robin protocol for its failure detection speed, and we propose a variant of this protocol which is more reliable than the application execution in any case. Experiments involving of up to 256 processes, carried out on Grid’5000, show that the real detection times closely match the predictions.</div>
</front>
</TEI>
<affiliations><list><country><li>France</li>
<li>Thaïlande</li>
</country>
<region><li>Grand Est</li>
<li>Lorraine (région)</li>
</region>
<settlement><li>Vandœuvre-lès-Nancy</li>
</settlement>
</list>
<tree><country name="France"><region name="Grand Est"><name sortKey="Genaud, Stephane" sort="Genaud, Stephane" uniqKey="Genaud S" first="Stéphane" last="Genaud">Stéphane Genaud</name>
</region>
<name sortKey="Genaud, Stephane" sort="Genaud, Stephane" uniqKey="Genaud S" first="Stéphane" last="Genaud">Stéphane Genaud</name>
<name sortKey="Genaud, Stephane" sort="Genaud, Stephane" uniqKey="Genaud S" first="Stéphane" last="Genaud">Stéphane Genaud</name>
<name sortKey="Jeannot, Emmanuel" sort="Jeannot, Emmanuel" uniqKey="Jeannot E" first="Emmanuel" last="Jeannot">Emmanuel Jeannot</name>
<name sortKey="Jeannot, Emmanuel" sort="Jeannot, Emmanuel" uniqKey="Jeannot E" first="Emmanuel" last="Jeannot">Emmanuel Jeannot</name>
</country>
<country name="Thaïlande"><noRegion><name sortKey="Rattanapoka, Choopan" sort="Rattanapoka, Choopan" uniqKey="Rattanapoka C" first="Choopan" last="Rattanapoka">Choopan Rattanapoka</name>
</noRegion>
<name sortKey="Rattanapoka, Choopan" sort="Rattanapoka, Choopan" uniqKey="Rattanapoka C" first="Choopan" last="Rattanapoka">Choopan Rattanapoka</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Wicri/Lorraine/explor/InforLorV4/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 003957 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 003957 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Wicri/Lorraine |area= InforLorV4 |flux= Main |étape= Exploration |type= RBID |clé= ISTEX:5E7C8EC4D7C270F8D66020C33884FC34178138D6 |texte= Fault-Management in P2P-MPI }}
This area was generated with Dilib version V0.6.33. |